DOCOHEHT BESOHE 



BC 088 288 



FL 005 220 



EDHS PRICE 
DESCBIFTOBS 



AOTHOR y-ea^ Hayne ; find Others 

TITLE nse of Syntactic Segmentation and Stressed Syllable 

Location in Phonemic Recognition. 
PUB DATE Nov 72 

MOTE 12p,; Paper presented at the Heating of the 

Acoustical Society of America (84th, Hiami, Florida, 
November 1972) 

HF-$0«75 HC-$T-50 

Articulation (Speech) ; Computers; Consonants; 
♦Distinctive Features; Graphs; language Patterns; 
♦Language Research; Oral Expression; Pattern 
Recognition; Phonemes; *Phoneroics; *Phonological 
Units; Phov:ology; *Phrase Structure; Sentence 
Structure; Syllables; Syntax; Vovels; Word 
Recognition 

ABSTRACT 

Automatic speech recognition is expected to be core 
successful when syntactically-related information is incorporated 
into early stages of recognition. Phonemic decisions, in particular, 
are expected to be more accurate and less ambiguous when contextual 
information is considered. A computer program detected about 90^ of 
all boundaries between major syntactic constituents from fall-rise 
fundamental freguency (Fo) contours in the Rainbow Script as read by 
two talkers. A procedure was devised for locating stressed syllables, 
in constituents, from (1) high-energy and increasing-Fc portions near 
the peaks of Fo contours and (2) local increases in Fo from an 
archetype falling contour. The procedur^i succeeded in locating most 
syllables which had been perceived by listeners to be stressed. A 
recognition strategy is outlined for using detected syntactic 
boundaries and stressed syllable locations in estimating distinctive 
features of some phonemes in connected speech. Vowel and consoiianat 
recognition would be attempted first in the stressed syllables. Other 
readily-detected segments, such as coronal strident fricatives, vould 
be found. A pilot study showed that front/back decisions for vcwels 
are more reliable in stressed than unstressed syllables. 
(Author/DD) 



USE OP SYNTACTIC SEGMENTATION AND STRESSED SHIiABLE LOCATION 



IK PHONEMIC KECOGNmON 




Wayne A, Lea 
Mark ?• Medress 
Toby E. Skinner 



ISiivac DSD 
P.O. Box 3525 
St. Paul, Minn. 55165 



ABSTRACT 



Automatic speech recognition is expected to be more successful when syn- 
tactically-related information is incorporated into early stages of recogni- 
tion. Phonemic decisions, in pai'ticular, are expected to be more accurate 
and less ambiguous when contextual information is considered. A computer 
program detected about 90^ of all boundaries between ma,1or syntactic consti- 
tuents, from fall-rise fundamental frequency (Pq) contours in the Rainbow 
Script, as read 1^ two taBcers. A procedure was devised for locating stressed 
syllables, in constituents, from (a; high-energy and increasing-Po portions 
near the peaks of Pq contours and (b) local increases in Pq from an arche- 
type falling contouTo The procedure succeeded in locating most tsyllables 
(83/5 for one ta3Jcer, 985^ for another) which had been perceived by listeners 
to be stressed; A recognition strategy is outlined for using detected syn- 
tactic boundaries and sioressed syllable locations in estimating distinctive 
features of some phonemes in connected speech. Vowel and consonant recog- 
nition would be attempted first in the stressed syllables. Other readily- 
detected segments, such as coronal strident fricatives, would be found. A 
pilot study showed that front/back decisions for vowels are 5 Indeed, more 
reliable in stressed than in unstressed syllables. 



Presented at the 84.th meeting of the Acoustical Society of .America, Miami, 
Florida, November, 1972. 



This research was supported in part by the Advanced Research Projects Agency 
of the Department of Defense under Contract DAHC15-72-C-0138, and in part by 
t|ie Univac Independent Research and Development Program. 



USE OF SYNTACTIC SEGMENTATION AND STRESSED SHiLABLE LOCATION 

IN PHONEMTC RECOGNITION 



Wayne A, Lea, Mark F. I^fedress, and Toly E, Skinner 

1., structural Aids to Phonemic Recognition 

Llnguistio and perceptual arguments* (Lea, 1972) sxzggest that devices 
xhich recognize speech will have to make use of grammatical stiructure iix 
early stages of the I'ecognition procedures. This can be accomplished, in 
part, hy using the prcsodic features to segment the speech into grammatical 
phrases, and to identify those syllables that are given prominence, or stress « 
in the sentence structure. 

One way in which prosodic information, and. resulting syntactic segmen- 
tation and stress pattern analyses, may be used to aid distinctive features 
estimation is as follows. At an early stage in recognition, one detects 
boundaries between major syntactic constituents from prosodic features. 
Then, the highest-stress syllables within each constituent are located, using 
reliable prosodic cues to stress. Some distinctive features are then esti- 
mated within these stressed syllables, since the consonants and vowels are 
expected to be more reliably encoded in stressed syllabled than in weakly 
stressed or reduced syllables (Hughes, Li, and Snow, 1972). Next, the par- 
tial distinctive features description is matched with generated or stored 
patterns for possible stressed syllables or wrds in the lexicon. Then a 
guess as to the word content of the constituent is made, based on the relia- 
ble feature information from the stressed syllables, plus other reliable data 
within the constituent (such as presence of coronal, strident fricatives, 
etc.; cf. Medress, 1972). If reliable decisions cannot be made based on such 
minimal feature information i/ithin the constituent, analyses are then applied 
to other wrds or syllables at lower stress values, and a guess based on the 
two or more moderately-stressed syllables is ma.de. Iteration would continue 
until all syllables are analyzed, if necessary. Each iterative guess as to ^ 
constituent identity would be combined \d.th those for other constituents in 
the sentence until a satisfactory set of hypotheses for all constituents 



ERLC 



1 



yielded the grammatical, meaningful sentence, 

A computer program was previously implemented for detecting boimdaries 
between major grammatical constituents from fall-rise "valleys" in funda- 
mental frequency (F^) contours (Lea, 1972)* It has successfully detected 
80 to 90^ of all predicted boundaries, for weather reports, newscasts, and 
stories composed of monosyllabic words, as read by six talkers. 

On the other hand, no algorithm for locating stressed syllables in 
connected speech has previously been developed. The use of boxmdary de- 
tections and stressed syllable locations in aiding distinctive features est- 
imation has also not been tested. One problem in testing a stressed syllable 
locator is to first establish what are properly considered as the actual 
stressed syllables in the speech. 

Experiments have been designed to study syntactic boundaries and stress 
patterns in the first paragraph of the well-known ^'Rainbow Passage" (Fairbanks, 
I94.O; Lea, Medress, and Skinner, 1972). Perception tests were conducted to 
provide the standard stress decisions whereby acoustic cues to stress could be 
tested. 

2« Perceived Stress Patterns in the Rainbow Script 

Three listeners (WAL, MFM, and TES) individually heard the Rainbow 
Script as recorded by two male talkers (ASH and GViH)r Polloid.ng a modifi- 
cation (cfi, Lea, Medress, and Skinner, 1972) of the procedure used by Hughes, 
Li, and Snow (1972), each listener heard clauses or sentences in the Rainbow 
Script repeated at will {hy rewinding and replaying a tape). The lisoener 
-was instructed to mark (in whatever way he chose), for each syllable, whe- 
ther he heard that sj'*llable as stressed, unstressed, or reduced. To fac- 
ilitate marking for each syllable, the script was typed on a sheet of paper 
with vertical slashes between syllables. A mark was required for each 
syllable (between two slash marks). The listener received one' such sheet 
for each talier. 

Each listener repeated the experiment at least twice, to establish 
lis'bener consistency fi^om one time to another. Figure 1 shows resulting 



ERIC 



2 



"confuoion" mntrlces of perceptions from ouo trial to tho noxt. Dlfferoncos 
frcjrn talkor to talker were slight, so that data for both are pooled In 
Flgtire 1» In general, the preponderance of judf^ents were on the diagonal, 
showing that perceived stress levels were consistent from repetition to 
repetition. A third trial by listener WAL again showed similar results. 
Listeners WAL and MFM reported that they categorized as follows: "Is the 
syllable stressed? If not, is it reduced? If not, (that is, for the left- 
overs) it is unstressed". Listener TES usod an alternate strategy wherely 
reduced syllables were the left-over category. For all listeners, confu- 
sions between reduced and unstressed syllables were more fjrequent than those 
between stressed and unstressed syllables. 

Comparisons of the perceptions of one listener (WAL) with the other*! 
are shown in Hgure 2, for the first trials tjy each listener. Listener 
TES, with his different strategy, perceived far more reduced syllables, 
and far fewer stressed syllables than the other two listeners© 

Shown in Figures 3 (for taHcer ASH) and 4 (for talker GWH) are the 
syllable-by-syllable perceptions of stress levels* The majority decisions 
of all three trials by listener WAL were combined xdth those of the first 
trials hy listeners MFM and TES, to yield overall impressions of the rela- 
tive stressedness of each syllable* Plotted for each rf the syllables in 
the Rainbow Script are the number of stressed judgments minus the number of 
reduced judgments, for the three listeners. Unstressed judgments vjere 
assigned values of zero^ Cases '*^ere the reduced judgment of listener TES 
cancelled a stressed judgment by another listener are shown by double- 
ended arrows {%). For example, in the syllable sun, all three listeners 
heard it as stressed, so that a value of +3 was plotted; if two perceived 
a syllable as rsduced o and the other perceived it as jmstressed, (such as 
with -sion in the second sentence), a value of -2 resulted. The syllables 
-vAiich were most definitely stressed (i.e., perceived by all listeners as 
stressed) thus were at the top of the scale; those definitely perceived 
as reduced were at the bottom of the scale o (Also shown on Figures 3 and 
U are lines separating grammatical constituents, and boxes and circles 
surrounding various stressed syllables, to be explained in sections 3 and 
A.) 



ERIC 



3 



3o. Boundary Detection Results 

A liand analysis vas done on the P contours of the Bainbow Script, 
strictly following the algorithm for detecting boundaries between major 
syntactic constituents, but incorporating a few slight refinements for 
eliminating some false boundary detections. The detected boundaries are 
marked in Figures 3 and 4- by lines between the printed syllables they sep- 
arate. As might be expected from previous studies (Lea, 1972), most (8656 
for ASH, 92^ for GWH^) of all boundaries between major syntactic consti- 
tuents, as predicted by an independent syntactic analysis, were correctly 
detected (but not located) . Six bomdaries between minor syntactic constitu- 
ents were also detected in the data for talker GWH. One "false" (syntac- 
tically unrelated) boundary detection was obtained at a consonant-vowel 
transition for ASH, and two for GWHo 

Figure 5 shows how many stressed syllables occurred in the detected 
constituents, for each talker. In only two of the detected constituents, 
for each talker, did no stressed syllable occur* These re stilted from im- 
properly placed or ''false" boundaries* On the other hand, well over half of 
the constituents had exactly one stressed syllable, in them* In a banse, the * 
constituent boundary program may then be said to be detecting soma of the 
stressed syllables (but not all). If each constituent had exactly one lex- 
ical word with a major -stressed syllable within it (Chansky, 1965; Emonds, 
1970), we might expect constituent detections to be thus closely associated 
with the presence of stressed syllables. 

It would appear that the boundai^ detection program could be used to 
detect many stressed syliablesc- To locate the stressed syllable or syllables 
within each constituent, further techniques were needed.. 

4« Stressed Syllable Location 

Many studies have shown that peak F^, local increases in F^ (Bolinger, 
1958), and energy integrals (Medress, Skinner, and Anderson, 1971) are 
among the best acoustic correlates of stress, at least In Isolated woitis* 
Intonation studies (Armstrong and Ward, 1926) have shown that, in connected 
texts, F^ peaks near, the first stressed syllable (the so-called "HEAD") of 
each hreath group, and falls gradually until the last stressed syllable, 
It " ^ 

These scores neglect Noun Phrase - Verbal boundaries, which, previous stiidies 
(lea, 1971j 1972) had shown are not reliably manifested* 



after which niay occur tho rapid fall of an utteranco-flnal "TUNE I" contour 

or the rise In at the ond of "TUNK II" contours (which mark "incompletion") . 

These concepts vere incorporated in an algorithm for stressed sylla- 
ble location. In the algorithm, it was assumed that each major constituent 
woiild have a TUNE I or II contour, that the peak would be near the first 
stressed syllable in the constituent, that the first stressed syllable 
("HEAD") cotild be located by high-energy and rising-P^ portions (bounded, 
by dips in speech intensity at syllable limits). Other stressed syllables in 

the constituent are assumed to be manifested by deviations in P above an arche- 

o 

type line (a straight line on a semi-log plot) from the peak to the P value 
at tho end of the constituent. Again, such stressed syllables are delimited by 
decreases in energy. 

The results of such algorithmic locations of stressed syllables were 
compared with the perceptions of stress. For talker ASH, 29 of the 33 de- 
tected HEADS had been perceived as stressed 1:^ two or three listeners 
(that is, had a Stress Score SS - +2 or -i-3), and two other detected HEADS 
were perceived as stressed by two listeners, yet reduced by listener TES. 
One false HEAD, due to a false boundary detection, was not perceived as 
stressed hy any listener. Ten other syllables :dth SS = +2 or +3 were also 
found by the stressed syllable locator program, while eight were missed. 
Thus, for talker ASH, 83% of all syllables with SS = +2 or +3 were located, 
while three syllables {7% of all located ^llables) perceived as stressed 
by two listeners, but r3duced by the other, were detected, and five lo- 
cated syllables (10^ of all located syllables) were not perceived as stressed 
by at least two listenert; (SS <4-l). 

For talker GWH, 935^ of all stressed (SS = +2 or +3) syllables were 
located by the algorithm, with one miss on a syllable of SS = +2, plus six 
false alarms (with no more than one 'listener perceiving the syllable as 
stressed). Three syllables perceived as stressed by two listeners, but re- 
duced by TES, were also located. 

Figures 3 and 4 illustrate these results in more detail. The stress 
score for each HEAD which was correctly detected has been boxed in (H); 
while eve37y other (non-HEAD) syllable located by the algorithm is marked by 
a triangle A around the. stress score for that syllable. It is important to 



realize that the stressed syllables enclosed by the squares and triangles In 
Figure 3 and k were not always the only data Included vlthln the high'-'energy 
speech portions as located by the algorithm . Extended voiced sequences, and 
especially sonorant sequences, may have no significant energy dips, so that 
sequences such as -orizon and boiling may be included in the located "stressed 
syllables". In fact, three cases ( long round , one end, and man looks ) are 
shovn in Figure k where, because of no substantial energj'- dips between sylla- 
bles, more than one stressed syllable was included in the HEiiD, Such sylla- 
bles could be considered located, even though they are not separated from a 
neighboring stressed syllable, since they still would be included in the data 
processed by the strategy which estimates distinctive features vdithin the 
located stressed "syllables". 

Thus, the large majority of stressed syllables and major syntactic 
boundaries were correctly established by the two algorithms based on prosodlc 
patterns • 

5. Applications to Distinctive Featxires Estimation 

A pilot study of the differences in the effectiveness of distinctive 
features estimation in stressed (SS = +2 or +3) versus unstressed (SS = -1, 
0 or +1) syllables was undertaken, for the data of talker GVJE. For each of 
the 40 stressed and 23 unstressed syllables (excluding ones with the diph- 
thongs al or ol), the center position of the energy maximm was determined. 
Then a program for identifying whether a vowel was front or back was applied 
to the three time segments centered at this position, and the majority vote 
of the three frame decisions was taken as the front/back identity of the 
vowel. We may define the error rate as the percentage of such majority de- 
cisions that were front for phonemically back vowels, or vice versa© The 
error rate was 22$6 for the unstressed vowels and 8^ for the stressed ones, 
indicating that front/back decisions were more likely to be in error in 
unstressed than in stressed vowels* 

This is only one indication of the differences expected in distinctive 
features estimation within stressed versus unstressed syllables. Success in 
identifying obstruents is expected to be better within stressed syllables, 
and other vowel features (such as the distinction between high and low yowels) 
will undoubtedly be affected. The strategy of distinctive features estima- 



tion is thus expected to be different, depending upon whether the syllable is 
stressed or unstressed. 



The better strategies for distinctive features estimation are expected to 
be substantially aided by the automatic detection of syntactic boundaries and 
the location of stressed syllables, as described in this paper. In addition, 
boundaries and stressed syllable information may be useful in syntactic pars- 
ing and other aspects of speech recognition. 



ACKNOWLEDGEMENT 

The authors are indebted to George W, Hughes and Kung-Pu Li of Purdue 
Uiiiversity for providing the speech data analyzed in this reseai^ch. The 
procedure for perceptual judgments of stress is a modification of that used 
i^* Hughes, Li, and Snow (1972). 



REFERENCES ' . 

ARMSTRONG, L* E. and WARD, I. C, (1926), H<jidbook of English Intonation . 
CambridgGS Heffer (2nd Edit.). 

BOLINGER, D, (1958), A Theory of Pitch Accent in English. Word , vol, 14, 
p* 109. 

CHOMSKY, N. (1965), Aspects of the Theory of Syntax . Cambridge, Mass.: 
M.I.T, Press, Chapter 2. 

E^©N^^S, E. (1970), Root and Structure Preserving Transformations, PhcD. 
Thesis, Linguistics Dept,, M.I.T, 

FAIRBANKS, G. {^^/fi), Voice and Articulation Drillbook . New York: Harper 
and Row« 

HUGHES, G. W., LI, K.-P, aiid SNOW, T. B. (1972), An Approach to Research - 
on Word Spotting in Continuous Speech, Proc. 1972 Conf. on Speech Cofflmuni« *> 
cation and Processing . Nevton, Mass., pp* 109-112. 

LEA, W, A. (1971), Automatic Detection of Constituent Boundaries in Spoken 
English, Jm Acoust. Soc. Aner .. vol. 50, p* 116(A). 

LEA, W, A. (1972), Intonational Cues to the Constituent Stinicture and 
Phonemics of Spoken English, Ph.D. Thesis, School of E.E.y Purdue Univer- 
sity. Portions of that research appeared in "An Approach to Syntactic Racog-^ 
nition vrithout Phonemics", Proc. 1972 Conf, on Speech Communication, and 
Processing . Newton, Mass.s pp© 198-201. 



ERLC 



7 



LEA, A,, MEDRESS, M, P., and SKINNER, T, E, (1972) Proaodlc Aide to Speech 
Recognition , Semiannual Technical Report, ARPA Contract DAHCl 5-72-C-0138o 
IMivac Report No^ FX 794.0, October, 1972* 

MEDRESS, M, F. (1972) A Procedure for the Machine Recognition of Speech-^ 
Proc, 1972 Conf# on Speech Communication and Processing , Nevton, >jaa0», 
pp. 113-1160 

MEDRESS, M, SKINNilli, T. E,, and ANDERSON, D. E, (1971), Acoustic 
Correlates of Word Stress, Presented to 82nd Meeting, Acoustical Society of 
America, Denver, Colorado, October 20, (Paper K3). 



:3 

» S 
tt SI 

It 



tlllAt 1 BY ^tSrSMSM WAL 




T«Al 1 OV LOTINIfl MFh: 

4rauu<» UMrauMD nioucffi 




TMtL 1 BY 


LisrcwM Tit 


110 


4 


0 


8 


104 


5 


0 


B 


29 


8 


0 


7 


18 


6 


5 a 

ii 


2 


70 


32 


9 


44 


11 


0 


12 


97 




0 




35 


Ii 


0 


12 


141 

























Figure 1. Perceived stress levels fron repetition to repetition, with results 
for both talkers pooled, (a) Listener WAL; (b) Listener MFM; (oj Listener TES. 



8 



101 


5 


0 


13 


22 


45 


0 


4 


63 



AnOMKNTt BT LttTtHIR WAL • 



i 



38 



53 



18 



26 



Figure 2, Perceived stress levels for one listener versus the other listeners, 
viith resTjlts for both talkers pooled, (a) Listener WAL versus listener MFMj 
(b) listener WAL versus listener TES. 



ERIC 



a 




41^ 



'QN3 

3N0 



O 4^ 



1.^ 



.s 

-P 
O 
H 



E-i - . 

CO •d 
CO ra rt .H 

P3 <D -P O 

O CO CO 

•P O ^ d a rH +3 

pi 3 4:> S « 

« & O Q © 

•d w « ^ H 




I 

o p 
43 •d 



' 'd cj 

w ffl o -p 

m H 43 (D ca ^ 

W 43 »d p ^ p 

oa o o ca 

^ xaM CO 
w 43 



»x« o 

CQ 09 •P U 

'd o 

43 H 43 ^ 

P CO C 

0 01 a 

CM o«g 

Dl 43 43 43 




^ ^ ^ in 



ERIC 



I 

Pi 
4 



I I 



^ <y — o 

+ + 4 



I 



N3HM 




to 



M 

+ 



O 

4 • 



10 



CXI 



7 " 



tn 
I 



4- i 



(a) Talker ASH 




0 1 2 3 4 5 6 
^(UMBER OF STRESSED SYLLABLES IN CONSTITUENT 




(b) Tallcsr GWH 



0 1 2 3 4 5 * 6' 
NUMBER OF STRESSED SYLLABLES IN CONSTITUENT 



Figure 5. Preqiiencies of Occurrences of Perceived Stressed (SS = +2 or +3) 
Syllables idthin Detected Constituents of the Rainbow Script, 



ERIC 



11 



